Translation as Linear Transduction Models and Algorithms for Efficient Learning in Statistical Machine Translation
نویسنده
چکیده
Saers, M. 2011. Translation as Linear Transduction. Models and Algorithms for Efficient Learning in Statistical Machine Translation. Acta Universitatis Upsaliensis. Studia Linguistica Upsaliensia 9. 133 pp. Uppsala. ISBN 978-91-554-7976-3. Automatic translation has seen tremendous progress in recent years, mainly thanks to statistical methods applied to large parallel corpora. Transductions represent a principled approach to modeling translation, but existing transduction classes are either not expressive enough to capture structural regularities between natural languages or too complex to support efficient statistical induction on a large scale. A common approach is to severely prune search over a relatively unrestricted space of transduction grammars. These restrictions are often applied at different stages in a pipeline, with the obvious drawback of committing to irrevocable decisions that should not have been made. In this thesis we will instead restrict the space of transduction grammars to a space that is less expressive, but can be efficiently searched. First, the class of linear transductions is defined and characterized. They are generated by linear transduction grammars, which represent the natural bilingual case of linear grammars, as well as the natural linear case of inversion transduction grammars (and higher order syntaxdirected transduction grammars). They are recognized by zipper finite-state transducers, which are equivalent to finite-state automata with four tapes. By allowing this extra dimensionality, linear transductions can represent alignments that finite-state transductions cannot, and by keeping the mechanism free of auxiliary storage, they become much more efficient than inversion transductions. Secondly, we present an algorithm for parsing with linear transduction grammars that allows pruning. The pruning scheme imposes no restrictions a priori, but guides the search to potentially interesting parts of the search space in an informed and dynamic way. Being able to parse efficiently allows learning of stochastic linear transduction grammars through expectation maximization. All the above work would be for naught if linear transductions were too poor a reflection of the actual transduction between natural languages. We test this empirically by building systems based on the alignments imposed by the learned grammars. The conclusion is that stochastic linear inversion transduction grammars learned from observed data stand up well to the state of the art.
منابع مشابه
An Efficient Shift-Reduce Decoding Algorithm for Phrased-Based Machine Translation
In statistical machine translation, decoding without any reordering constraint is an NP-hard problem. Inversion Transduction Grammars (ITGs) exploit linguistic structure and can well balance the needed flexibility against complexity constraints. Currently, translation models with ITG constraints usually employs the cube-time CYK algorithm. In this paper, we present a shift-reduce decoding algor...
متن کاملStatistical Machine Translation Part II: Tree-Based SMT
One of the most active and promising areas of statistical machine translation (SMT) research are tree-based SMT approaches. Tree-based SMT has the potential to overcome the weaknesses of early SMT architectures which (a) do not handle long-distance dependencies well, and (b) are underconstrained in that they allow too much flexibility in word reordering. In this tutorial, we will review the var...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملLinear Transduction Grammars and Zipper Finite-State Transducers
We examine how the recently explored class of linear transductions relates to finite-state models. Linear transductions have been neglected historically, but gainined recent interest in statistical machine translation modeling, due to empirical studies demonstrating that their attractive balance of generative capacity and complexity characteristics lead to improved accuracy and speed in learnin...
متن کامل